fix: SQLite survival hardening — anomaly-store memory blowup + security/CI quick-wins#97
Open
aksOps wants to merge 2 commits into
Open
fix: SQLite survival hardening — anomaly-store memory blowup + security/CI quick-wins#97aksOps wants to merge 2 commits into
aksOps wants to merge 2 commits into
Conversation
- security(api): close cross-tenant read caused by middleware ordering — TenantMiddleware now passes through when auth already pinned a tenant (HasTenantContext), so a per-tenant key can't be escaped via X-Tenant-ID - fix(ingest): correct token-bucket sampler math; the old cost (1/rate) exceeded the cap for rate<1.0 so ~100% of healthy spans were dropped (SQLite default 0.05 persisted almost no baseline traces) - fix(api): clamp limit/offset on /api/logs and /api/traces (negative limit was passed to GORM as unlimited — heap/DB DoS) - fix(ingest): sanitize X-Tenant-ID on the HTTP OTLP path (gRPC parity) - fix(mcp): don't cache error tool results; enforce the response byte cap in resourceResult (trace_graph DB fallback was uncapped) - fix(ui): correct ServiceSidePanel test for split design-system markup, mount ErrorBoundary, derive connected badge from ws.status - chore: bump go directive to 1.25.11 to unblock the OSV-Scanner CI gate Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A 15-min SQLite soak at 120 services drove RSS to ~1.8 GB and climbing. Heap profiling (gc=1) attributed 84% of the live heap to AnomalyStore PRECEDED_BY edges: the 10s detector minted a NEW anomaly node every tick per erroring service (UnixNano-suffixed ID), and correlateWithRecent then created O(N^2) edges among them — unbounded until the 24h TTL. - fix: stable per-(service,type) anomaly IDs so detection UPSERTS one evolving node instead of one-per-tick; this bounds both the node map and the edge mesh (AnomalyStore 272 MB -> 2.6 MB; peak RSS 1.8 GB -> 292 MB, now flat over the full 15 min). + regression test. - feat: applyMemoryLimit() sets a soft GOMEMLIMIT at startup — honors an explicit env value, else 75% of the detected cgroup/host budget — so the GC paces against a ceiling instead of letting next_gc run away. Defense in depth; cgroup v2/v1 + /proc/meminfo detection, stdlib-only. + tests. Validation: 3x 15-min soaks + heap profile; integrity ok, 0 drops/429s, 0 ERROR/panic, clean shutdown, goroutines/fds recover, 30k spans/120 svcs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
Hardens the SQLite-survival profile after a 15-minute, 120-service soak surfaced a serious in-memory blowup, plus a batch of security / CI-unblock / reliability fixes. Two atomic commits.
fix(graphrag): bound anomaly-store memory + GOMEMLIMIT safety netA 15-min SQLite soak at 120 services drove RSS to ~1.8 GB and climbing. Heap profiling (
gc=1, true live set) attributed 84% of the live heap toAnomalyStorePRECEDED_BY edges: the 10s detector minted a new anomaly node every tick per erroring service (anom_<svc>_err_<UnixNano>), andcorrelateWithRecentthen created O(N²) edges among them — unbounded until the 24h TTL.AnomalyStore272 MB → 2.6 MB; peak RSS 1.8 GB → 292 MB, now flat over the full 15 min.applyMemoryLimit()— sets a softGOMEMLIMITat startup (honors an explicit env value, else 75% of the detected cgroup v2/v1 →/proc/meminfobudget) so the GC paces against a ceiling instead of lettingnext_gcrun away. Defense-in-depth; stdlib-only.fix: security, CI-unblock, and reliability quick-winsTenantMiddlewareno longer overwrites an auth-pinned tenant (per-tenant key could be escaped viaX-Tenant-ID).1/rate> cap for rate<1.0 → ~100% of healthy spans dropped; SQLite default 0.05 persisted almost no baseline traces).limit/offseton/api/logs&/api/traces(negative limit → GORM unlimited = DoS).X-Tenant-IDon the HTTP OTLP path (gRPC parity).resourceResult.ServiceSidePaneltest (split DS markup), mountErrorBoundary, derive the connected badge fromws.status.godirective to1.25.11to unblock the OSV-Scanner CI gate.Validation
PRAGMA integrity_check = ok; 0 drops/429s; 0 ERROR/panic; clean shutdown; goroutines/fds recover to baseline; 30,285 spans / 120 services persisted.go build ./...,go vet ./...,gofmt,go test ./...(pass),golangci-lint(clean on changed files),osv-scanner(green).Storage note (7-day rolling retention)
At the tested profile: ~3.3 GB/day → ~25 GB steady-state on disk for 120 services. ~1.2 KB/persisted-span (incl. trace row + indexes + any error-log); ~50% data / 50% indexes. Dominated by the synthetic 4.2% error rate (errors are always kept) — realistic <1% error rates land closer to ~10–20 GB. Beyond that band, Postgres is the recommended path (per the existing
main.gowarning).Known limitations / follow-ups
main.go:625G115 lint finding is untouched.🤖 Generated with Claude Code